Study guide
PSY715 – Research design & statistics
This covers aspects of the class considered important and not self-evident, but it should not be considered as the only document to review.
This document is updated regularly. Check back soon!
1 Research design
1.1 Types of research design
1.1.1 Quantitative vs. Qualitative
- Quantitative research design
Involves the collection and analysis of numerical data using statistical methods. Emphasizes objective measurements, standardized instruments, and statistical analyses to establish relationships and make generalizations.
- Qualitative research designs
Focuses on understanding and interpreting subjective experiences, meanings, and social contexts. Utilizes methods such as interviews, observations, and textual analysis to generate rich and in-depth descriptions and explore complex phenomena.
1.1.2 Observational vs. Experimental
- Observational research design
Involves observing and documenting behavior or phenomena in their natural settings without any intervention or manipulation by the researcher. Focuses on describing and understanding relationships, patterns, and behaviors as they naturally occur.
- Experimental research design
Involves manipulating variables and randomly assigning participants to different conditions to establish cause-and-effect relationships. Allows researchers to control and manipulate independent variables while measuring the effects on dependent variables to make causal inferences.
1.1.3 Inductive vs. Deductive
- Inductive Research Design:
- Starts with observations and data.
- Generates theories or generalizations based on patterns or trends identified in the data.
- Moves from specific observations to broader conclusions or theories.
- Deductive Research Design:
- Starts with theories, hypotheses, or existing knowledge.
- Tests specific hypotheses or predictions derived from theories.
- Involves collecting data to confirm or refute the hypotheses.
- Moves from general theories or hypotheses to specific observations.
1.2 Roles of variables
- Independent Variable (IV):
- Manipulated or controlled by the researcher.
- Changes intentionally.
- Hypothesized cause or predictor.
- Dependent Variable (DV):
- Measured or observed outcome.
- Affected by the independent variable.
- Variable of interest.
- Extraneous Variable:
- Variable(s) that may influence the relationship between the IV and DV.
- Not intentionally manipulated or controlled by the researcher.
- Need to be identified and controlled to ensure accurate interpretation of the relationship between the IV and DV.
1.3 Levels of measurement
- Nominal:
- Categorical data without any inherent order or numerical value.
- Examples: Gender, eye color, marital status.
- Ordinal:
- Categorical or ordered data with a relative ranking or order.
- Categories have a meaningful order, but the differences between them may not be equal.
- Examples: Likert scales, educational levels (e.g., elementary, middle, high school).
- Interval:
- Numerical data with equal intervals between values.
- No true zero point.
- Arithmetic operations like addition and subtraction can be performed.
- Examples: Temperature in Celsius or Fahrenheit.
- Ratio:
- Numerical data with equal intervals between values and a true zero point.
- All arithmetic operations can be performed.
- Examples: Height, response time, income.
1.4 Qualities of research designs
- Internal validity:
- Refers to the extent to which a study accurately measures the cause-and-effect relationship between variables.
- Involves controlling potential confounding factors and ensuring that changes in the dependent variable are due to the manipulation of the independent variable.
- External validity:
- Represents the generalizability of research findings to the broader population or real-world contexts.
- Considers factors such as sample representativeness, research settings, and the ecological validity of the study.
- Ethical soundness:
- Involves ensuring that research adheres to ethical principles and guidelines.
- Includes obtaining informed consent from participants, protecting their rights and privacy, and minimizing any potential harm or discomfort.
- Measurement reliability:
- Refers to the consistency and stability of measurement or data collection procedures.
- Measurement validity:
- Represents the extent to which a study measures what it intends to measure or assess.
1.5 Sampling biases
- Convenience sampling bias:
- Occurs when participants are selected based on their availability or convenience.
- Can lead to a non-representative sample that may not accurately reflect the target population.
- Self-selection bias:
- Arises when individuals voluntarily choose to participate in a study.
- Can introduce bias as those who self-select may have unique characteristics or motivations that differ from the general population.
- Sampling bias due to non-response:
- Occurs when selected participants decline or fail to respond to the study’s invitation.
- Can result in a biased sample if non-responders have different characteristics than responders.
- Attrition bias:
- Occurs when there is a differential loss of participants or data during the course of a study.
- Can introduce bias if attrition is related to the variables being studied and can affect the representativeness and validity of the findings.
- Sampling bias due to small sample size:
- Occurs when the sample size is too small to represent the target population adequately.
- Findings based on small samples may not generalize well and can be more susceptible to chance variations.
- Sampling bias due to non-random selection:
- Arises when participants are not randomly selected from the population of interest.
- Can lead to a sample that is not representative and limits the generalizability of the study’s findings.
2 Univariate statistics
2.1 Location / Central tendency
Measures of location inform us about the typical observation for a variable.
2.1.1 Mode
The mode is the observation with the largest frequency.
2.1.2 Median
The median is the observation that splits the ordered observations into 2 equal sized groups. It also the 50% percentile and 2nd quartile.
2.1.3 Mean
The mean \(\bar{y}\) is the sum of observations over the sample size.
\[\bar{y} = \frac{\sum y}{N}\]
Its value in the population is by convention written \(\mu\).
2.2 Measures of dispersion
Measures of dispersion inform us about how scattered the observations are. They consequently also inform us about how accurate measures of location are.
2.2.1 Range
The range is the distance between the maximum and the minimum.
2.2.2 Standard deviation
The standard deviation \(s\) is the typical distance to the mean.
\[s = \sqrt{ \frac{\sum (y - \bar{y})^2}{N-1}}\]
Its value in the population is by convention written \(\sigma\).
2.2.3 Variance
The variance \(s^2\) is the square of the standard deviation.
\[s^2 = \frac{\sum (y - \bar{y})^2}{N-1}\]
Its value in the population is by convention written \(\sigma^2\).
3 Linear models
3.1 General formulation and structural assumption
Linear Models (LM) are a class of models, in which an outcome variable \(Y\) is predicted as a linear function of \(p\) predicting variables \(X_1, X_2,...,X_p\).
For the \(i^\text{th}\) case, we have:
\[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + ... + \beta_2x_{2i} + \epsilon_i\]
\(\beta_0\), \(\beta_1\), …, \(\beta_p\) are fixed real numbers to estimate, called model parameters.
\(\beta_0\) is referred to as the intercept.
\(\beta_1\), \(\beta_2\), …, \(\beta_p\) are referred to as the slope(s).
3.2 Distributional assumption
The errors are assumed to be randomly drawn from a Gaussian (Normal) distribution of fixed variance \(\sigma^2\):
\[\epsilon_i \sim \text{Normal}(0,\sigma^2)\]
Alternatively, we can write that the distribution of \(y\), conditional upon predictor values \(x\) follows a Normal distribution of fixed variance \(\sigma^2\):
\[y|x_i \sim \text{Normal}(\mu_i,\sigma^2)\]
3.3 Important particular cases
3.3.1 Intercept-only models
The intercept-only model (or grand mean model, or mean model) is a model that predicts the outcome as a constant (the intercept \(\beta_0\)).
\[y_i = \beta_0 + \epsilon_i\]
The one-sample \(t\) test and the paired-samples \(t\) tests can be seen as intercept-only linear models.
3.3.2 Models with only one slope
Models with an intercept \(\beta_0\) and only one slope (and thus one predictor) \(\beta_1\) also have particularities.
\[y_i = \beta_0 + \beta_1x_{1i} + \epsilon_i\]
Simple regression / Pearson correlation \(r\) and the independent-samples \(t\) test are models with an intercept and a single slope.
3.4 Relation to the mean
The predicted/expected value for \(y\) – noted \(\hat{y}\) or \(E(y)\) – is equal to the sample mean of \(y\) (conditional on the predictors, if any). It is also the maximum likelihood estimator of the population mean (conditional on the predictors, if any).
Thus, a linear model is essentially a model that predicts the mean (of a Normal distribution).
3.5 Parameter estimation
In Linear Models, parameters are in general estimated through Ordinary Least Squares. It consists in finding parameter values so that the sum of squared errors (\(SSE\) or \(SSR\)) is minimized.
3.6 Testing parameters
In general, in linear models, parameters can be significance tested against some comparison value \(\beta_{H_0}\) (most of the time \(0\)), using a \(t\) test (often referred to as a Wald \(t\) test).
The null hypothesis is specified as:
\[H_0 : \beta = \beta_{H_0}\]
The (non-directional) hypothesis is thus:
\[H_1 : \beta \neq \beta_{H_0}\]
Under \(H_0\), the difference between the sample estimate \(\hat{\beta}\) and the comparison value \(\beta_{H_0}\), divided by the standard error of \(\beta\) follows a Student’s \(t\) distribution, with degrees of freedom \(df = N - p - 1\) (note: \(p\) is here the number of predictors in the model).
\[t = \frac{\hat{\beta} - \beta_{H_0}}{SE_{\beta} } \sim t(N-p-1)\]
The calculation of probabilities for values above and below \(t_\text{observed}\) yield a \(p\) value, which is the probability that \(t\) is greater than \(t_\text{observed}\) in absolute value: This is the \(p\) value.
A \(p\) value smaller than \(.05\) allows to conclude to a significant difference between the parameter and the comparison value. A \(p\) value larger than \(.05\) does not allow to make any conclusion regarding that difference.
3.7 Sums of squares
Sums of squares are used to decompose the variability of the outcome variable (i.e., to do an “analysis of variance”).
The sum of squared errors \(SSE\) quantifies the distance between the predictions of the model \(\hat{y}\) and the observations \(y\): \(SSE = \Sigma(\hat{y} - y)^2\)
The total sum of squares \(SST\) quantifies the distance between the predictions of the simplest model possible (the intercept-only model, or mean model) \(\bar{y}\) and the observations \(y\): \(SST = \Sigma(\bar{y} - y)^2\)
The model sum of squares \(SSM\) quantifies the distance between the predictions of the model \(\hat{y}\) and the predictions of the simplest model possible (the intercept-only model, or mean model) \(\bar{y}\): \(SSM = \Sigma(\hat{y} - \bar{y})^2\)
The intercept-only model has an \(SSM=0\) and an \(SSE = SST\). Thus the \(SST\) is the \(SSE\) of an intercept-only model.
3.8 Coefficient of determination \(R^2\)
\(R^2\) is used to determine the fit of the model to the data. It is computed as the variability explained by the model \(SSM\) over the total variability to explain \(SST\):
\[R^2 = \frac{SSM}{SST}\]
\(R^2\) ranges from \(0\) (lowest fit possible) to \(1\) (perfect fit).
\(R^2\) is generally interpreted as the proportion of variance of \(Y\) explained by the model.
For the intercept-only model, \(R^2=0\). It is trivial and thus not reported.
3.9 \(F\) test of model fit
Model fit can be significance tested. We specify a null hypothesis that indicates that the model does not predict the data (more than an intercept-only model) in the population:
\[H_0 : R^2 = 0\]
As a result the alternate hypothesis is:
\[H_0 : R^2 > 0\]
Under \(H_0\), the ratio of the mean squares of the model (\(SSM/df_M\)) over the mean squares of the errors (\(SSE/df_E\)), named \(F\), follows a Fisher’s \(F\) distribution, of degrees of freedom \(p\) and \(N-p-1\):
\[F = \frac{SSM/df_M}{SSE/df_E} = \frac{MSM}{MSE} \sim F(p, N-p-1)\]
The calculation of probabilities for values above and below \(F_\text{observed}\) yield a \(p\) value, which is the probability that \(F\) is greater than \(F_\text{observed}\): This is the \(p\) value.
A \(p\) value smaller than \(.05\) allows to conclude to a significant difference in fit between the model and the intercept-only model (we generally say that the model “significantly fits”, or that \(R^2\) is significant). A \(p\) value larger than \(.05\) does not allow to make any conclusion regarding model fit.
4 Testing linear model assumptions
4.1 Conditional normality
4.1.1 What we study
To test conditional normality, depending on the model, we generally study one or more distributions.
If there are no predictors (i.e., intercept-only model), we directly study the distribution of \(Y\).
If the predictor \(X\) is a discrete variable, we can study the distribution of \(Y\) for each (observed) value of \(X\). The ensemble of these distributions of \(Y\) at each \(X\) is referred to as the conditional distribution of \(Y\).
If the predictor \(X\) is a continuous variable, we generally study the distribution of the residuals of the model.
4.1.2 How we study it
Distributions can be studied for normality, notably using:
- histograms, frequency plots, density plots
- normality tests (e.g., Kolmogorov-Smirnov test, Shapiro-Wilk test)
- Should be non-significant (a significant test indicates a significant departure from normality)
- measures of skewness and kurtosis
- Should be close to \(0\)
4.2 Homogeneity of variance
4.2.1 What we study
To test homogeneity of variance (i.e., homoscedasticity), we use different procedures depending on the model. In general:
If there are no predictors (i.e., intercept-only model), homogeneity of variance is trivial (the variance cannot vary as a function of predictors if there isn’t any).
If the predictor \(X\) is a discrete variable, we can study the distribution of \(Y\) (or of the residuals) for each (observed) value of \(X\) (if the assumption is true, we should not have different variances of \(Y\) for different values of \(X\)).
If the predictor \(X\) is a continuous variable, we generally study how the residuals may vary as a function of the predictor (if the assumption is true, we should see be no relation).
4.2.2 How we study it
Homogeneity of variance is notably investigated using the following tools:
- Levene’s test (compares variances across independent groups). A significant test indicates significantly different variances (i.e., violated assumption).
- A box-plot/density plot/histogram by group is often used with it.
- Auxiliary regression (a regression model where the residuals are predicted using the model predictors). A significant test implies a variance that significantly varies as a function of the predictors (i.e., violated assumption).
- A residuals by predicted plot is often used with it.
5 Intercept-only models
The following models can be discussed as applications of intercept-only linear models. An intercept-only model is formulated as:
\[y_i = \beta_0 + \epsilon_i\]
Some important note regarding intercept-only linear models:
the population mean of \(Y\), often noted \(\mu\), is equal to the population intercept \(\beta_0\).
the (Maximum Likelihood / Ordinary Least Squares) estimate of the intercept \(\hat{\beta}_0\) is equal to the sample mean \(\bar{y}\).
5.1 One sample \(t\) test
The one-sample \(t\) test is used for situations where we want to compare the mean of a variable \(Y\) with some theoretical value of interest \(\mu_0\) (some reference value, some general population value, a threshold, the central point in a scale, etc.).
5.1.1 Mean differences
A common way to describe the difference between the mean \(\bar{y}\) (which is the sample mean and the estimator for the population mean) and the theoretical value \(\mu_0\) is to compute a (raw) mean difference \(\Delta \bar{y}\):
\[\Delta \bar{y} = \bar{y} - \mu_0\]
This difference is expressed in the original units of the variable \(Y\): A mean difference of \(.4\) means that the mean \(\hat{y}\) is \(.4\) units higher than the reference value \(\mu_0\).
But in a lot of cases in the social sciences, these units are arbitrary and/or meaningless. Therefore, it is frequent to express the mean difference in standard deviations. For this, we compute the Standardized Mean Difference (\(SMD\), also referred to as Cohen’s \(d\)):
\[SMD = d= \frac{\Delta \bar{y}}{\sigma}\]
If the population standard deviation \(\sigma\) is known (in practice, it is rarely the case), it is used in the formulation. More frequently, we replace it with its estimator, which is the sample standard deviation \(s\).
A standardized mean difference of \(.2\) means that the mean \(\hat{y}\) is \(.2\) standard deviations higher than the reference value \(\mu_0\).
The standardized mean difference is the most common measure of effect size in this context.
| Cohen’s d (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.19 | Negligible effect size |
| 0.20 to 0.39 | Small/weak effect size |
| 0.40 to 0.69 | Medium/Moderate effect size |
| 0.70 and above | Large/Strong effect size |
5.1.2 Null hypothesis
In a one-sample \(t\) test, the null hypothesis is formulated so as to imply no mean difference in the population:
\[H_0 : \mu = \mu_0\]
Alternatively, we can write:
\[H_0 : \mu - \mu_0 = 0\]
5.1.3 Alternate hypothesis
The non-directional hypothesis states that the mean differs from the reference value in the population:
\[H_1 : \mu \neq \mu_0\] Which can be written as:
\[H_1 : \mu - \mu_0 \neq 0\]
A directional alternate hypothesis specifies a direction for that inequality:
\[H_1' : \mu > \mu_0 \text{ or } H_1' : \mu < \mu_0\]
If using a directional hypothesis, the direction must be specified prior to analysis.
5.1.4 The \(t\) statistic
Under \(H_0\), the mean difference is null in the population, which implies that the following \(t\) statistic…
\[t = \frac{\bar{y} - \mu_0}{SE_\bar{y}} = \frac{\bar{y} - \mu_0}{s/\sqrt{N}}\]
…follows a Student’s \(t\) distribution in the sample, with degrees of freedom \(df = N - 1\)
\[t \sim t(N-1)\]
\(SE_\bar{y}\) is the standard error of the mean.
5.1.5 \(p\) value
The observed \(t\) value is located in the \(t(N-1)\) distribution. The probability that \(t\) is greater in absolute value than the \(t_\text{observed}\) is the \(p\)-value.
- If \(p<.05\), we reject \(H_0\), and therefore conclude that the population mean is different from \(\mu_0\) (i.e., there is a significant mean difference)
- If \(p>.05\), we cannot reject (or confirm) the null hypothesis, and thus cannot conclude (i.e., the mean difference is non-significant).
For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu > \mu_0\) the \(p\)-value is the probability that \(t\) is greater than \(t_\text{observed}\). For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu > \mu_0\) the \(p\)-value is the probability that \(t\) is smaller than \(t_\text{observed}\).
5.1.6 Relation to linear models
The one-sample \(t\) test is equivalent to the Wald \(t\) test of the intercept parameter in an intercept-only model, in which the comparison value \(\beta_{H_0} = \mu_0\).
\[y_i = \beta_0 + \epsilon_i\]
In a Wald test of \(\beta_0\), we would have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE_{\beta_0}}\]
Since \(\beta_0\) corresponds to the population mean, \(\hat{\beta}_0\) to its sample estimate, and \(\beta_{H_0} = \mu_0\), we have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE_{\beta_0}} = \frac{\bar{y} - \mu_0}{SE_{\bar{y}}} \]
5.1.7 Assumptions
The assumption of normality is here directly tested using the sample distribution of \(Y\), using the usual tools (density plot, histogram, Shapiro-Wilk test, etc.).
The assumption of homogeneity of variance is not tested because there is no predictor that would make variance heterogeneous.
5.2 Paired-samples \(t\) test
The paired-sample \(t\) test is used for situations where we want to compare two repeated measures (e.g., a measure used at two points on the same persons). We will name these repetitions \(1\) and \(2\) throughout.
5.2.1 Paired differences
Let us note \(\Delta Y\) the paired differences variable, which is defined as the differences between the two repetitions \(1\) and \(2\), such as:
\(\Delta Y = Y_2 - Y_1\)
Some software compute \(\Delta Y\) as \(Y_1-Y_2\), others as \(Y_2-Y_1\).
| Case | \(Y1\) | \(Y_2\) | \(\Delta Y = Y_2 - Y_1\) |
|---|---|---|---|
| 1 | 10 | 15 | 5 |
| 2 | 12 | 18 | 6 |
| 3 | 8 | 11 | 3 |
| 4 | 9 | 14 | 5 |
| 5 | 11 | 16 | 5 |
| … | … | … | … |
5.2.2 Mean differences
A common way to describe the difference between the two means \(\bar{y}_1\) and \(\bar{y}_2\) is to compute a (raw) mean difference \(\overline{\Delta y}\):
\[\overline{\Delta y} = \frac{\sum \Delta y}{N}\]
This difference is expressed in the original units of the variable \(Y\).
The difference of two means \(\bar{y}_2 - \bar{y}_1\) is equal to the mean of the differences \(\overline{\Delta y}\). This is not true of many other statistics however (e.g., median, standard deviation). Technically here, we are studying the mean of the differences.
In a lot of cases in the social sciences, these units are arbitrary and/or meaningless. Consequently, it is frequent to express the mean difference in standard deviations. To do this, we compute the Standardized Mean Difference (\(SMD\), also referred to as Cohen’s \(d\)):
\[SMD = d= \frac{\overline{\Delta y}}{s_{\Delta y}}\]
For a better estimation of the standardized mean difference, we would prefer to use the population standard deviation of the differences \(\sigma_{\Delta y}\) . In practice, however, it is never known or assumed. Therefore, its estimator, the sample standard deviation of the differences \(s_{\Delta y}\) is always used instead.
The standardized mean difference is the most common measure of effect size in this context.
| Cohen’s d (absolute value) | Interpretation |
|---|---|
| 0.00 to 0.19 | Negligible effect size |
| 0.20 to 0.39 | Small/weak effect size |
| 0.40 to 0.69 | Medium/Moderate effect size |
| 0.70 and above | Large/Strong effect size |
5.2.3 Null hypothesis
In a paired-samples \(t\) test, the null hypothesis is formulated so as to imply no mean difference in the population:
\[H_0 : \mu_1 = \mu_2\]
Alternatively, we can write:
\[H_0 : \mu_2 - \mu_1 = 0\]
5.2.4 Alternate hypothesis
The non-directional hypothesis states that the mean differs from the reference value in the population:
\[H_1 : \mu_1 \neq \mu_2\] Which can be written as:
\[H_1 : \mu_2 - \mu_1 \neq 0\]
A directional alternate hypothesis specifies a direction for that inequality:
\[H_1' : \mu_1 > \mu_2 \text{ or } H_1' : \mu_1 < \mu_2\]
If using a directional hypothesis, the direction must be specified prior to analysis.
5.2.5 The \(t\) statistic
Under \(H_0\), the mean difference is null in the population, which implies that the following \(t\) statistic…
\[t = \frac{\Delta Y }{SE_{\Delta Y}} = \frac{\Delta Y }{s_{\Delta y}/\sqrt{N}}\]
…follows a Student’s \(t\) distribution in the sample, with degrees of freedom \(df = N - 1\).
\[t \sim t(N-1)\]
\(SE_{\Delta Y}\) is the standard error of the mean of the differences.
5.2.6 \(p\) value
The observed \(t\) value is located in the \(t(N-1)\) distribution. The probability that \(t\) is greater in absolute value than the \(t_\text{observed}\) is the \(p\)-value.
- If \(p<.05\), we reject \(H_0\), and therefore conclude that the means are different in the population (i.e., there is a significant mean difference)
- If \(p>.05\), we cannot reject (or confirm) the null hypothesis, and thus cannot conclude (i.e., the mean difference is non-significant).
For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu_2 > \mu_1\) the \(p\)-value is the probability that \(t\) is greater than \(t_\text{observed}\). For a one-tailed test (i.e., directional alternate hypothesis) assuming \(\mu_2 < \mu_1\) the \(p\)-value is the probability that \(t\) is smaller than \(t_\text{observed}\).
5.2.7 Relation to linear models
The paired-sample \(t\) test is equivalent to the Wald \(t\) test of the intercept parameter in an intercept-only model, in which the comparison value \(\beta_{H_0} = 0\), and in which the predicted variable consists of the differences \(\Delta Y\):
\[\Delta y_i = \beta_0 + \epsilon_i\]
In a Wald test of \(\beta_0\), we would have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE(\beta_0)}\]
Since \(\beta_0\) corresponds to the population mean (of \(\Delta Y\)), \(\hat{\beta}_0\) to its sample estimate (i.e., \(\overline{\Delta Y}\)), and \(\beta_{H_0} = 0\), we have:
\[t = \frac{\hat{\beta}_0 - \beta_{H_0}}{SE(\beta_0)} = \frac{\overline{\Delta Y}}{SE_{\Delta Y}} \]
5.2.8 Assumptions
The assumption of normality is here directly tested using the sample distribution of \(\Delta Y\), using the usual tools (density plot, histogram, Shapiro-Wilk test, etc.).
The assumption of homogeneity of variance is not tested because there is no predictor that would make variance heterogeneous.
5.2.9 Common graphical representations
- Box plots
- Means with 95% Confidence Intervals
- Observations connected by case (“Spaghetti plot”)
6 Linear regression
6.1 Simple linear regression
Simple linear regression is used for situations where we have one numeric predictor (IV) \(X\) used to predict one numeric outcome (DV) \(Y\).
The relation between the two variables is assumed to be linear:
\[y_i = \beta_0 + \beta_1x_{i} + \epsilon_i\]
Simple regression is thus a Linear Model.
6.1.1 Parameter estimation
Like all linear models, the intercept and slope are estimated through ordinary least squares, which consist in minimizing the Sum of Squared Errors (\(SSE\)).
6.1.2 Interpretation
The (unstandardized) intercept estimate \(\hat{\beta_0}\) is the predicted value for \(y\) when \(x_1 = 0\) (all in the original units of \(x\) and \(y\) )
The (unstandardized) slope estimate \(\hat{\beta_1}\) is the predicted change in \(y\) when \(x\) increases by one (all in the original units of \(x\) and \(y\) )
The slope estimate can be converted to a standardized slope estimate. The standardized slope estimate is is the predicted change in \(y\) (in standard deviations) when \(x\) increases by one standard deviation.
The standardized intercept is by definition \(0\) (and thus trivial and not reported).
6.1.3 Relation to the correlation coefficient
In simple linear regression, the standardize slope estimate is equal to the Pearson correlation coefficient \(r_{XY}\).
In addition, the coefficient of determination \(R^2\) is here equal to the square of the correlation coefficient \(r_{XY}^2\)
This is not the case in multiple regression.
6.1.4 Effect size
The most common measure of effect size in this context is the standardized slope estimate / correlation coefficient.
| Standardized Slope (absolute value) | Interpretation | |
|---|---|---|
| 0.00 to 0.09 | Negligible/Null effect size | |
| 0.10 to 0.29 | Small/weak effect size | |
| 0.30 to 0.49 | Medium/Moderate effect size | |
| 0.50 and above | Large/Strong effect size |
6.1.5 Test of the parameters
The intercept and the slope can be tested through (Wald) \(t\) tests, where:
\[ t = \frac{\beta}{SE(\beta)} \]
The null hypothesis is stated as:
\[ H_0 : \beta_\text{population} = 0 \]
The (non-directional) alternate hypothesis is stated as:
\[ H_1 : \beta \neq 0 \]
Under \(H_0\), \(t_\text{sample}\) follows a Student’s \(t\) distribution (centered around \(0\)), with degrees of freedom \(df=N-2\). The observed \(t\) is compared with that distribution, resulting in a \(p\) value.
The \(p\) value indicates the probability to observe a parameter at least as far from \(0\) as \(\hat{\beta}\), if \(H_0\) is true.
If \(p<.05\) , we generally say that the intercept or the slope is significant (or significantly different from \(0\).
If the slope is significant, we often say that the effect of \(x\) is significant.
6.1.6 \(R^2\) and \(F\) test
In this context, the coefficient of determination is redundant with the correlation coefficient, and is therefore rarely presented.
Similarly, the \(F\) test of model fit yields the same \(p\) value as the (two-tailed) Wald \(t\) test of the slope estimate, and is therefore redundant and rarely presented.
Although rarely reported in this context, they can be (see general section on linear models).
6.1.7 Software
Estimates and their tests are generally presented by statistical software in coefficient tables, where each line represents a parameter (here, intercept and slope), and columns indicate the estimate, standard error, \(t\), and \(p\) value (at least).
| Estimate | SE | t | p | |
|---|---|---|---|---|
| (Intercept) | 13.0651 | 0.4310 | 30.3159 | 0.0000 |
| extraversion | -0.1289 | 0.0332 | -3.8797 | 0.0001 |